By: Sunda Gerard
knitr::opts_chunk$set(echo = FALSE)
# Importing the libraries
library(ggplot2)
library(ggridges)
library(reshape2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(GGally)
library(memisc)
library(Hmisc)
library(pander)
library(corrplot)
#Importing the data into RStudio
df <- read.csv('wineQualityReds.csv')
Abstract
This project is for the exploration of data on Red Wine and the chemical properties of the wine that may affect the quality of the wine. The project data is imported into RStudio for more exploratory data analysis.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
A few personal observations so far of the data:
There are 13 variables and 1599 observations X is an indexing variable for identifying the individual red wines. Since this does not contribute to the analysis, we will remove this variable. *There are 12 other variables, although one of them (quality) is the ultimate deciding factor on how the wine tastes. The remaining 11 variables affect the ultimate quality of the wine.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Now that the X variable is removed, we can focus in our the data analysis within our dataset.
First, I want to see what each variable looks like graphed. I will create individual histograms of each variable.
Looing at our analysis, we can see the following types of distributions: Normal: Volatile.acidity, density, and pH are all skewed as normal distributions. Positively skewed: Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, suplhates, and alcohol are all skewed in long-tail. Outliers: Many of the variables have extreme outliers including suphates, total.sulfur.dioxide, chlorides, and residual.sugar. Quality: Many of the wines are in the 5 or 6 quality range. This seems to suggest that there aren’t many terribl tasting red wines on this list, and only a few great tasting ones.
To place the positively skewed distributions into a normal distribution, we can call the sqrt command to make this change.
#Univariate Analysis
There are 12 variables, with 1,599 observations.
The main interest in the dataset is what variables positively and negatively affect the quality of red wine.
I believe that factors including, but not limited to alcohol, density, residual sugar, and pH will make a difference in the quality of the wine.
We normalized a few variables that were previously skewed distributions.
The Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, suplhates, and alcohol variables were positively skewed and later converted into a more normal distribution.
Suphates, total.sulfur.dioxide, chlorides, and residual.sugar all had extreme outliers.
Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I used the sqrt function to tidy the Fixed.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, suplhates, and alcohol variables into a normal distribution.
I removed the X variable, which was used for indexing in the dataset.
The first relationship I am interested in determining is between free.sulfur.dioxide and total.sulfur.dioxide. Not knowing much about either or how it translates into making wine, I am interested in seeing if the variables are correlated in any way.
The relationship seems to show a positively correlated relationship between the free.sulfur.dioxide and total.sulfur.dioxide. As one increase, so does the other in most cases.
The next relationship I wanted to examine was between citric.acid and quality. It would seem that freshness and flavor that citric.acid brings would make the wine taste better. There is no strong correlation apparent between citric.acid and the quality of the wine, however, most wines have a citric acid below .5 and are between 4.5 and 6.5 in quality. A small impact relationship is present, but not enough that we can definitively say that citric.acid affects the quality of wine strongly.
I am now going to see if residual.sugar has any strong effect on the quality of wine. Taste preferences can vary for sweet or dry wine, so I believe that this will have no strong correlation to quality.
Surprise! The residual.sugar amount of most of the wines already less than 4, but they all have varying levels of quality. This shows that dry wines are typically rated higher quality than those that are sweet, but some of this may be related to the fact that most of the wines are not sweet to begin with. Like citric.acid, the quality of the wines are mostly between 4.5 and 6.5, with a chunk also around 7. Still, we can say that this is a correlation that needs more examination.
Next, I want to examine chlorides or the amount of salt in wines and see if there is any relationship to quality of the wine.
It looks like there is some correlation between the amount of chlorides in wine and the quality of the wine. The less chloride in the wine, the less salty the wine, which seems to mean a better taste. Most of the wines do not have many chlorides though, with a majority being less than 0.2. This breakdown looks fairly similar to the residual.sugar plot. Since they are similar, I want to compare the two variables on a plot.
Definitely a lot of correlation there between the two variables, but a few outliers. Next I want to see if there is any relation between the alcohol content and quality of wine.
There is a strong correlation here between alcohol content and the quality of wine. As the alcohol content increases, so does the quality in most cases. Alcohol content less than 10 for the most part has a wine quality between 4.5 and 6.5, with a few outliers. As we step up in alcohol content between 10 an 12, the quality varies between 4.5 and 6.5 again, but also has a lot of wines in the 6.5 to 7.5 quality. Most wines with alcohol content greater than 12 have a quality of 6.5 or greater.
Next, I want to take a look at volatile.acidity to see if there is a relationship between this variable and the quality of wine. High amounts of volatile.acidity usually makes wine have a vinegar taste that is strong and pugnent.
It becomes apparent that the lower volatile.acidity translates into higher quality wines, whereas higher volatile.acidity results in mostly lower quality wines. This makes sense because people want to drink wine, not drink vinegar.
Next, I want to look at fixed.acidity to see if there is correlation to wine quality.
There are low amounts of fixed.acidity across the board , with most below 12 and a majority below 10. Concentrations of wine qualities of 5 and 6 are prevelant. Still, we do not see a strong correlation here, but this variable should be investigated further in the multivariate analysis section.
I want to see if density has any affect on wine quality. Although most of the density for the wines are between .995 and 1, there seems to be no real correlation between the density and quality of the wine.
Next, I want to investigate if pH has any effect on wine quality. This plot looks very similar to the density plot before this one. Most of the wines are between 3 and 3.5 pH, while the quality is all over the place. I don’t see any correlation from this plot, but we will investigate further shortly.
Lastly, we will look at sulphates and it there is correlation to wine quality.
Most all of the wines are less than 1 sulphate, but the wine quality is concentrated in three spots, at 5, 6, and 7. As the number of sulphates increase, so does the quality in general. There is not enough to really say there is a strong correlation, but some relationship is evident.
I observed that alcohol and volatile.acidity were the most telling variable relationships that affected the quality of wine. There was also noticable effects from citric.acid and sulphates. More examination will be needed for both residual.sugar and chlorides, although those both were observed to be small factors in determining the quality of wine.
I was surprised that residual.sugar and chlorides were closely correlated.
Finding that alcohol content greatly affected the taste and quality of red wine was the strongest relationship variable.
I want to examine further the four variables that we concluded may have a relationship to the quality of wine: alcohol, fixed.acidity, volatile.acidity, and pH. I wanted to re-examine the relationship between fixed.acidity and quality of red wine, with an emphasis on the alcohol content. I focused on the alcohol content since we are certain that is a factor in the quality of wine. It looks like the obvious of higher alcohol content equaling higher wine quality is displayed. Since most of the quality is all over the place on the plot and sulphates are almost all between 0.5 and 1.0, we can say that there is a correlation between sulphates and the quality of red wine, just not one as strong as the relationship between alcohol and quality.
Next, I want to see how volative acidity is related to alcohol and the quality of wine.
We can also see in this plot that high quality wines tend to contain higher alcohol content levels. We can see that most higher quality wines have low volatile.acidity levels. This allows us to see that there is an impactful negative correlation. We will see how impacful the correlation is shortly.
I want to see if citric.acid and the alcohol content have strong impacts on the quality of the wine.
We can see by the plot that there is a relationship between the alcohol content of wine and the citric.acid of wine. Most are between .25 and .75-so there is a sweet spot for making good quality wine, but higher alcohol content seems to continue to make good wines in general.
I wanted to see if there was a negative relationship between citric.acid and volatile.acidity, similar to the relationship between quality and volatile.acidity.
After this plot, we can see that most wines have between 0 and 0.75 citric.acid, but most higher quality wines are between 0.25 and 0.75. There is a negative correlation between volatile.acidity and citric.acid, similar to that of quality and volatile.acidity. We can at this point know that there is an impact of both citric.acid and volatile.acidity on the quality of wine.
The last plot I want to make for multivariate is to determine the relationship between pH and alcohol content of our wines.
We can determine by the plot that there is a positive relationship between alcohol content and the pH of wines in our dataset. As the alcohol content increases, generally the pH of wine also increases. This is especially true with our wines that scored a 6, 7, or 8 in quality. Since these two are correlated, we can say that pH does have some impact on the quality of wine.
There is positive correlation between the pH content and alcohol content of wines. Volatile.acidity and citric.acid have a negative correlation. Citric.acid and alcohol content have a positive correlation. Sulphates and quality of wine have a positive correlation. Finally, there is a positive correlation between alcohol content and quality of wine.
I am surprised that some of the variables appear to be related strongly and that some variables are not clearly correlated with quality or with each other.
The first plot will be a number correlation matrix that will examine the relationship between the variables and determine the amount of positive or negative correlation to each other.
As we can see, there are a lot of interesting finds that provide clarification on the dataset. This plot also gives us concrete relationships and their significance level at a .05 confidence interval. We can see that the relationship between alcohol content and quality is .48, which shows a strong correlation between those variables. Other strong positive correlations to quality include sulphates at .25, citric.acid at .23, and fixed.acidity at .12. Quality is negatively correlated with volatile.acidity at -.39, followed further behind by total.sulfur.dioxide at -.19, density at -.17, and chlorides at -.13.
The next plot I want to examine the relationship between the negative correlation of volatile.acidity and the positive correlation of alcohol related to wine quality. We will display this on a heat map plot.
The heat map plot shows the strong correlation between the two variables and the impact on wine qualities. We see a lot more red dots at the top left of the plot indicating that higher wine quality is related to higher alcohol levels and lower volatile.acidity. Green dots indicating lower wine quality are more present at the bottom of the plot where alcohol content is lower and are also more present where volatile.acidity is higher.
The final plot will examine the mean or average of wine quality relative to the alcohol content. This should be interesting to see if there is a constant upward movement in the mean of alcohol content relative to quality. It will also let us see amounts of each quality level of wine.
There are some interesting results here in the plot. I was not expecting a dip in the alcohol content across the quality levels. We see that a wine quality of 3 starts with a mean alcohol content of 10. There are also not a lot of wines with a quality of 3. We then see a slight uptick in alcohol content with wines that have a quality of 4. There are also more wines that have a quality of 4 than 3. The interesting aspect is that wines with a quality rating of 5 have an average reduction in alcohol content, which dips down just below 10. We also see many more wines at this quality level than either 3 or 4. We experience another increase in alcohol content with wine quality levels of 6. We are somewhere at an average of between 10 and 11 in alcohol content at quality level 6. There is also a much bigger dispersement of the alcohol content at this quality level, with a majority between an alcohol content level of 9 and 13. We start to see a decrease in the amount of wines represented at quality level 7. We also see a continuous increase of the alcohol level to an average of between 11 and 12. We finally see that there are a little over a dozen wines at a quality level of 8 and that the average alcohol content of those wines also increases to an average level of 12.
As we found out, there isn’t a distinct formula to making great quality wine. Taste is different for every individual. Sweetness for some makes a better wine, while others prefer dry wines. I honestly expected wines that were more sweet to be of higher quality. Through investigation, our dataset indicated a strong positive correlation between alcohol content and the quality of wine. Typically, the stronger and higher the alcohol content, the higher quality wine. Although some of the data suggests that this is not always true, and other factors can cause the wine to be of varying quality. If higher alcohol content was equivelant to higher quality wine, then winemakers would simply spike the wine to increase the alcohol content. Other factors such as sulphates and citric acid can influence the taste of wine in a positive way. Factors such as volatile acidity can negatively influence the taste of wine.
We found other factors that can ruin the taste of wine if not handled properly including chorides, density, and total sulfur dioxide. Managing those negative influences of wine quality would assist winemakers in producing better quality wines, but not necessarily make an award winning wine. A sample of 1599 red wines is a start to examine the relationship between wine variables and the quality of wine, but further testing and data examination would be necessary before finding conclusive data that would support our findings in this study.